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Coding Choices] Fa Extraction Routines in AutoMap 
| ie ee "Stemming: Convert words into their morphemes. 
AutoMap: « wet "Reduction and Normalization: 
| nananQene ranses * "Negative filters such as delete lists and removal of symbols 
é ae » Extract —| Relational Data =Positive filters such as spelling correction and assigning synonyms 
<n Oa relational to unique key concept 
ee "Part of Speech Tagging: Assign a single best word class to 





every word. 

"Anaphora Resolution: Convert personal pronouns into 
entity or entities that a pronoun refer to. 

"Feature Identification: Automatically find the most 
important terms in a dataset. 

"Named Entity Extraction: Identify relevant types of 
information that are referred to by a name, such as people, 
organizations, and locations. 

"Ontological Text Coding: Identify and classify instances of 
pre- or user-defined node classes, such as Named Entities, 
resources, tasks, and time. 

"Identification of and reasoning about node and edge 
attributes, such as demographic data, beliefs, and types of 
relationships. 

"Email Data Analysis: Extract and combine different types of 


Data 
Network 
Analysis 


Visualization || Simulation 

























Illustrative Toy Example: 


“Jan Pronk, the Special Representative of Secretary-General Kofi Annan to 
Sudan, today called for the immediate return of the vehicles to World Food 
Programme (WFP) and NGOs.” (from UN News Service, New York, 12-28-2004): 
proximity-based extraction of relational data : 
one node type multiple entity classes 
@ Jan Pronk Jan Pronk 






@ Sudan @ Kofi Annan Sudan Kofi Annan 


@vehicles vehicles 





@ wrp NGO’s WFP NGO’s 















@ Knowledge Person Organization 














Location Resource ‘ 
a rr as networks, such as social networks and knowledge 
cee. Classification: networks, from emails. 
For relational data with at For ontologically coded ,; oe fae 
least one node type: Locate/ networks: Classify relevant =Entropy Assessment: Determine the variability of a text or 
identify relevant nodes: nodes according to an text set with respect to its vocabulary. 
(may be multi-word units) —_ ontology or taxonomy "Classical Content Analysis. 





"Read and write data and processing material from and to a 
default or user-specified database. 

























Development of Computational Solutions 


" Utilize techniques from Machine Learning and 
Artificial Intelligence 

=" Deploy and develop supervised and semi- 
supervised sequential stochastic learning 
techniques in order to train classifiers and 
build models that generalize to new data 

=" Construct a classifier h that for every sequence 
of (x, y) (joint probability) (where x = words 
per sequence and y = corresponding category) 
or (x]/y) (conditional probability) predicts a 
sequence y = h (x) for any sequence of x, 
incl. new and unseen data 

=" We work with Generative (aka discriminative) 

models: P(x,y), such as Hidden Markov Model 

(HMM), and Conditional models: P(y/x), such 

as Maximum Entropy Markov Models (MEMM) 

and Conditional Random Fields (CRF) 

















Example: Conditional Random Fields for Entity Extraction 

Identify and classify words that represent instances of entity classes of models or 

ontologies that deviate from classical set of Named Entities. 

=" Crucial step for coding texts as social-technical networks according to domain- 
specific ontologies and for advanced modeling of complex and dynamic real-world 
organizations or networks. 

= Model relationship among y, and y,, as Markov Random Field conditioned on x 

=" Conditional distribution of entity sequence y given observation sequence x 

computed as normalized product of potential functions M.. 




















=" Conditional probability of label sequence P(y/x), where both x and y are arbitrarily 
long vectors (consider arbitrarily large bag of features (> 10,000) )and any property 
of x, such as long-distance information) 











Evaluation 
= Rigorous assessment of the impact of information and relation extraction techniques 
on relational data and respective interpretations of socio-technical networks 


= Example: Impact of mule and coreference resolution: 
Table: Impact of AR, CR on edge level 


Routine measurement newswire newspaper broadcast 
raw unique nodes 4715 4884 3743 


total node weight 5774 5916 4536 


mm AR unique nodes 4599 A682 3659 

2. node weight reduction rate 2.5% 4.1% 2.2% 
‘> Ts unique nodes 3324 3213 2835 
node weight reduction rate 29.5% 34.2% 24.3% 

unique nodes 3050 2894 2596 
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